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(54) System and method for extracting data from semi-structured text 



(57) An information extractor for extracting subsets 
of text sequences having certain attributes from a semi- 
structured text sequence. The information extractor re- 
sembles a finite state transducer that includes a number 
of states and transition paths between the states. Each 
state is defined to be associated with the extraction of 
subsets of tokens having a certain attribute. Each tran- 
sition path represents a possible transition from one 
subset of tokens having a certain attribute to another 
subset of tokens having another attribute. Contextual 
rules are used to determine when a transition occurs 
and which transition path to take. By using contextual 
rules that consider the token patterns before and after 
a token, the information extractor can process docu- 



ments with irregular layout structures and attributes hav- 
ing a greater variety of permutations. 

The state transition diagram, used in one embodi- 
ment of the information extractor, represents the differ- 
ent sequence of states that the extractor may go through 
when it extracts attributes U,N,A and/or M from an entry 
in the input text sequence. The extractor starts in state 
b/e202, then reads a token and determines whether the 
token belongs to an attribute. For example, the first at- 
tribute within an entry must be either U or N. Hence, 
there is one transition path 218 from state b/e 202 to 
state L/ 204, and another transition path 220 from state 
b/e 202 to state N 208. If the token does not belong to 
either attribute U or A/, then the extractor remains at 
state b/e 202 : shown by the path 224. 
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Description 

BACKGROUND INFORMATION 

[0001] The present invention relates to extraction of 
information from text documents. 
[0002] Many organizations have large and increasing 
numbers of electronic files that contain information of 
great value. The World Wide Web ( n Web n ) itself is a 
huge depository of such information. Tools that enable 
computers to search and analyze this vast amount of 
information are being developed. However, unlike tab- 
ular information typically stored in databases today, 
many documents have only limited internal structure, 
and thus are called "semi-structured text". For example, 
a document relating to faculty member information may 
include a list of entries having words or strings that have 
certain attributes, such as the faculty name, personal 
Web page address, contact address, and academic title, 
etc. The information may be formatted into a tabular list 
for easy viewing. 

[0003] An example of semi-structured text is a Web 
page written in Hyper Text Markup Language (HTML). 
HTML uses tags that specify the formats and character- 
istics of graphics and characters in the Web page. For 
example, text lying between a pair of tags n <B> n and "</ 
B>" specifies that the text is to be displayed in bold. 
These HTML tags within the document are directed 
mainly toward formatting the output style, but the tags 
themselves may not tell the computer what content is 
between the tags. For example, given an input string 
"<B> John </B>", the tags "<B>" and "</B>" does not 
indicate to the computer whether "John" corresponds to 
a name, a contact address, or an academic title. 
[0004] A program written to extract information from 
documents or text sequences is called an information 
extractor or a "wrapper". An information extractor may 
be adapted to extract a particular set of words or strings 
having certain attributes from documents or text se- 
quences that have a particular structure. For example, 
an information extractor may be adapted to process a 
Web page such as the one shown in Fig. 1 (a) with four 
entries. Referring to Fig. 1 (b), the HTML text sequence 
for that Web page has four entries, each entry including 
a combination of text strings having attributes "URL (U) 
", "Name (/V)", "Academic Title (A) m , and "Administrative 
Title (M)". Referring to Fig. t (c), the first data tuple to 
be extracted from the first entry include four text strings 
having attributes U, /V, A, and M. The second and third 
data tuple each contain three text strings having at- 
tributes U, N, and A. The fourth data tuple contains text 
strings having attributes A/and A. For simplicity, a "text 
string having attribute If will simply be referred to as 
"attribute if. Thus, the term "attribute" will refer to a text 
sequence having a particular attribute. 
[0005] To extract attributes from the text sequence of 
a Web page, a preprocessor first segments the text se- 
quence into tokens. A token may contain words, num- 



bers, HTML tags, etc. For example, in Fig. 1(b), "<H1>" 
may be a token, while the word "Faculty" may represent 
another token. An attribute may comprise several to- 
kens. For example, the Name attribute "Mani Chandy" 

s may contains two tokens: "Mani" and "Chandy". The in- 
formation extractor reads the tokens, and applies a set 
of predefined extraction rules to extract attributes from 
these tokens. For example, one possible rule may be 
that whenever a pair of HTML tags "<B>" and "</B>" are 

10 encountered, the text between them is extracted as the 
Name attribute. 

[0006] Extraction rules are typically generated by a 
separate program called a learning program. A learning 
program reads a training sample (a short text sequence 

15 from the targeted Web page) with attributes labeled by 
a user. For example, a user may use a graphics user 
interface (GUI) to highlight or label the parts within the 
sample text sequence that correspond to the "Name" 
attribute. The learning program then finds the pattern for 

20 locations that the highlighted or labeled part appear in 
the sample text sequence. That pattern then becomes 
the rule for extracting the "Name" attribute. 
[0007] After a set of extraction rules are generated, a 
user may test these rules on a second sample text se- 

2S quence to see if the information extractor can correctly 
extract all the attributes. If not, then the learning pro- 
gram can be invoked again, and the user labels the at- 
tributes that were identified incorrectly with the correct 
attribute names. The learning program then modifies the 

30 extraction rules so that they can correctly identify the 
attributes in the first and second sample text sequences. 
The learning program is also used to update the extrac- 
tion rules when a targeted Web page changes its format 
or data structure. Because it is not uncommon for Web 

35 pages to be modified often, a learning program that re- 
quires a minimal set ol training samples is desired. 
[0008] When extracting attributes from a document, 
different types of information extractors may use differ- 
ent extraction rules. For example, one information ex- 

40 tractor called the "LR-wrapper" uses rules for finding to- 
kens with left-right pair structure. Another information 
extractor often referred to as the "stalker-wrapper" may 
have rules for skipping certain types of tokens and 
searching for certain kinds of "landmark" tokens. 

4S 

SUMMARY 

[0009] The present invention is directed to an infor- 
mation extractor that searches for attributes within a 

50 semi -structured input text sequence. In general, the in- 
formation extractor resembles a finite state transducer 
that includes a number of states and transition paths be- 
tween the states. Each state is defined to be associated 
with the extraction of a certain attribute. The allowable 

55 transition paths between the states are associated with 
the possible permutations of the attribute sequences. 
For each transition path, there is a corresponding con- 
textual rule. When a pattern in the input text sequence 
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satisfies a particular contextual rule, a transition be- 
tween states occurs, thus the information extractor en- 
ters into the next state that is associated with the extrac- 
tion of another attribute. By use of contextual rules that 
also considers the tokens before and after the current 
input token, the information extractor of the present in- 
vention can process documents with irregular layout 
structures and attributes having a greater variety of per- 
mutations. 

[001 0] The invention features an information extractor 
that can automatically extract attributes from a greater 
variety of document styles, and update the extraction 
rules in an efficient manner. Unlike other information ex- 
tractors that use a rigid set of rules, and thus can only 
process documents with attributes arranged with a cer- 
tain structure the present invention can adapt to varia- 
tions in document tormat styles. 

[0011] One advantage of the present invention is that 
the information extractor may be able to identify the at- 
tributes correctly even when there are missing at- 
tributes, oi if the order of attributes varies. 
[0012] Another advantage of the present invention is 
that the information extractor is suitable for processing 
documents having an irregular data structure or having 
a tabular layout style. 

[0013] An additional advantage of the present inven- 
tion is that the information extractor use a learning pro- 
gram that requires a minimal set of training samples to 
generate the extraction rules. 

DESCRIPTION OF DRAWINGS 

[0014] Fig. 1 (a) shows the output of a sample Web 
page relating to a faculty and research staff Web site. 
[0015] Fig. 1 (b) shows the HTML text sequence 
(source code) for the Web page of Fig. 1 (a). 
[0016] Fig. 1(c) shows the attributes extracted from 
the HTML text sequence of Fig. 1 (b). 
[0017] Fig. 2 is an example of a state transition dia- 
gram used in one embodiment of the information extrac- 
tor of the present invention. 

[0018] Fig. 3 shows a list of possible transition paths. 
[0019] Fig. 4 shows exemplary token types that may 
be used by an information extractor of the present in- 
vention. 

[0020] Fig. 5 shows a sample text sequence segment- 
ed into tokens using the token types listed in Fig. 4. 
[0021] Fig. 6 shows the HTML text sequence (source 
code) for a sample Web page. 

[0022] Fig. 7 shows a state transition diagram used in 
an alternative embodiment of the information extractor 
of the present invention. 

[0023] Fig. 8 is an example of a state transition dia- 
gram used in conjunction with the alternative embodi- 
ment of the present invention. 

[0024] Fig. 9 (a) shows the HTML text sequence of 
Fig. 1 (b) with arrows indicating the borders of the body 
containing the attributes. 



[0025] Fig. 9 (b) shows the HTML text sequence of 
Fig. 1 (b) with arrows indicating the borders of the at- 
tributes. 

[0026] Fig. 10 is an example of a tagged-list docu- 
5 ment. 

[0027] Fig. 11 is an example of a state transition dia- 
gram used in an embodiment of the present invention 
for extracting attributes from a tagged-list document. 
[0028] Fig. 1 2 (a) is an example of an attribute list ex- 

10 tracted from the tagged-list document of Fig. 1 0. 

[0029] Fig. 12 (b) is an example of a list of Tag and 
Value attributes extracted by an information extractor 
using the state transition diagram of Fig. 11 . 
[0030] Fig. 1 3 shows a sample text sequence with at- 

is tributes enclosed in boxes. 

[0031] Fig. 14 is an example of a flow diagram for gen- 
erating transition rules used by the information extractor 
of the present invention. 

[0032] Fig. 15 (a) is a sample text sequence having 
20 tokens A7, ... A20. 

[0033] Fig. 1 5 (b) shows the scope oi state Xand state 
N within the sample text sequence of Fig. 15(a). 
[0034] Fig. 15 (c) shows the definition of the left and 
right context. 

2S [0035] Fig. 16 shows the token class definitions used 
in an embodiment of the learning program according to 
the present invention. 

[0036] Fig. 1 7 (a) shows a first set of contextual rules. 
[0037] Fig. 17 (b) shows a second set of contextual 
30 rules. 

[0038] Fig. 1 7 (c) shows a third set of contextual rules. 
[0039] Fig. 18 shows a flow diagram for generating a 
contextual rule in accordance with an embodiment of the 
present invention. 

35 [0040] Fig. 1 9 shows a set of contextual rules for ex- 
tracting the attributes Name, Academic_title, and 
Ad min_titfe from a text sequence. 
[0041] Fig. 20 shows the sample text sequence of Fig. 
13 with the first token of the first attribute of each entry 

40 labeled B1 , B2, B3, and B4. 

DETAILED DESCRIPTION 

[0042] Referring to Fig. 2, a state transition diagram 
45 200 represents the state transition rules used in an em- 
bodiment of the information extractor. In this example, 
the information extraction is a single pass information 
extractor suitable for extracting attributes URL (U), 
name (A/), academic title (A), and administrative title (M) 
so from a text sequence. The input text sequence may 
come from a Web page such as the one shown in Fig. 
1 (a), or it may be a document generated by a word proc- 
essor having similar attributes. The input text sequence 
is first divided into tokens before being sent to the infor- 
55 mat ion extractor. Thus, the information extractor re- 
ceives the input text sequence along with a sequence 
of offset numbers indicating the location of each token. 
The information extractor can extract attributes within 
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an entry having the sequences of (U, N, A, A/f), (U, N, 
A), or (N, Ay if an input text sequence has different at- 
tributes, or has different permutations of attributes, then 
the single pass information extractor will also have a dif- 
ferent set of state transition rules. 
[0043] The term "attribute" is used to refer to a set of 
consecutive tokens that is a subset of the token se- 
quence having the same characteristics. For example, 
an attribute i/may represent a set of consecutive tokens 
that form an URL address. A token that is part of the 
URL address belongs to the attribute U. Similarly, the 
attribute N may represent a set of consecutive tokens 
that form a person's name. The term "state" is used as 
a shorthand or label for describing the actions of an in- 
formation extractor. The term "extractor" as used herein 
is intended to mean information extractor. When the ex- 
tractor is in state U, it means that the extractor is per- 
forming functions associated with extracting attribute U. 
Extracting an attribute may include the actions of read- 
ing a sequence of tokens and concatenating them to 
form a single string. When the extractor enters state LA 
it means the extractor performing functions associated 
with reading dummy tokens located between attribute U 
and the next attribute. A dummy token is a token that 
does not belong to an attribute. 

[0044] When the extractor is in state b/e (abbreviation 
for begin/end), it means the extractor is reading tokens 
that are between the last attribute of an entry and the 
first attribute of the next entry. The state b/e can also 
mean that the extractor is performing functions associ- 
ated with reading the dummy tokens before the first at- 
tribute in the input text sequence. The state b/e may also 
mean that the extractor is performing functions associ- 
ated with reading the dummy tokens after the last at- 
tribute in the input text sequence. In certain applications, 
a GB (abbreviation for Global Begin) state is defined to 
be associated with the actions of reading the tokens be- 
fore the first attribute. Likewise, a GE (abbreviation for 
Global End) is defined to be associated with the actions 
of reading the tokens after last attribute in the text se- 
quence. 

[0045] The term "contextual rule" refers to comparing 
the context of a token with a set of predetermined token 
patterns to see if there is a match. The context of a token 
includes the token under consideration, and possibly the 
tokens before and/or after it. For example, suppose the 
input text sequence is: 

<DT> <A HREF="http://www.cs.caltech.edu/people/ 
mani.htmP> and the token "http" identifies a transition 
from state b/e to state U, a contextual rule may be: 
Transfer from state b/e to state U if 

Left context = "HREF=s"" and 

Right context = "http://". 
[0046] The reference point for the left and right con- 
text is the location immediately preceding the token un- 
der consideration ("http"). Thus, the "left context" refers 
to the token(s) before "http", and the "right context" re- 
fers to "http" and a few additional tokens -- ":","/", and 7 



". The specific number of tokens in the left context and 
the right context varies with the contextual rules. For ex- 
ample, a contextual rule may require the left context to 
include 3 tokens and the right context to include only 
s one token. Another contextual rule may require the right 
context to include two tokens, while not requiring a com- 
parison of the left context. 

[0047] Within an entry, we define a dummy attribute 
U'Xo include all tokens that are between attribute (7 and 

to the next attribute. Likewise, we define a dummy attribute 
N'Xo include all the dummy tokens between attribute N 
and the attribute A. Dummy attribute ^'includes dummy 
tokens between attribute A and M. 
[0048] The state transition diagram in Fig. 2 repre- 
ss sents the different sequence of states that the extractor 
may go through when it extracts attributes U, N, A, and/ 
or /Wfrom an entry in the input text sequence. The ex- 
tractor starts in state b/e 202, then reads a token and 
determines whether the token belongs to an attribute. 

20 in this example, the first attribute within an entry must 
be either Uot N. Hence, there is one transition path 218 
from state b/e 202 to state 1/204, and another transition 
path 220 from state b/e 202 to state N 208. If the token 
does not belong to either attribute U or /V, then the ex- 

25 tractor remains at state b/e 202, shown by the path 224. 
[0049] Referring to Fig. 3, the possible state transi- 
tions for the extractor defined by the state transition di- 
agram of Fig. 2 is shown. For each possible state tran- 
sition listed in Fig. 3, there is a corresponding transition 

30 path in Fig. 2. The transition from state U 204 to state 
U' 206 means that the end of attribute U is detected, and 
that the current token (and possibly the tokens after- 
wards) is a dummy token that do not belong to any at- 
tribute. The transition from state tf'206 to state A/208 

35 means that the beginning of state N 208 is detected, and 
that the current token (and possibly the tokens after- 
wards) bebngs to attribute N. When the information ex- 
tractor extracts attributes U, N, and A from the second 
entry in Figure 1(b), it goes through the states b/e - U- 

40 u 1 - N-N'-A - b/e. After the last attribute (A) within an en- 
try is extracted, the extractor transfers to state b/e 202, 
meaning that the current token is a dummy token locat- 
ed before the first attribute of the next entry. 
[0050] There are two transition paths leaving state b/ 

45 e 202 in Fig. 2. The extractor determines which path to 
take by use of contextual rules. Each path has a corre- 
sponding contextual rule. If the context of a token match- 
es the contextual rule of a transition path, then the state 
is transferred according to that contextual rule. For ex- 

50 ample, the contextual rule associated with the transition 
of state b/e 202 to U 204 may be 

Rule 1 : Transfer from state b/e to state U if 

Left context = Calph(HREF) Punc(=) Punc 

(') and 

55 Right context = Oalph(http) Punc ( : ) Punc(/) 

Punc(/). 

(Note: The token class Calph() represents strings hav- 
ing all capital letters. The token class Punc ( ) represents 
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punctuation marks. The token class OalphQ represents 
strings having all lower case letters. These are de- 
scribed in more detailed in Fig. 4.) Rule 1 requires that 
the left context to be the string B HREF= on , and the right 
context to be the string B http:// U . The contextual rule as- 
sociated with the transition of state b/e 202 to N 208 
may be 

Rule 2: Transfer from state b/e to state N if 

Left context = Html(<B>) and 

Right context = Clalph(J. 
(Note: The token class Html() represents all HTML tags, 
and the token class Clalph() represents strings that be- 
gin with a capital letter and has at least one lower case 
letter. The symbol B _ B denotes the wildcard character, 
thus any string that begins with a capital letter and has 
at least one lowercase letter will match thetoken pattern 
Clalph(J.) 

[0051] If the context of a token satisfies Rule 1 , then 
the transition path from state b/e 202 to state U 204 is 
taken. On the other hand, if the context of a token sat- 
isfies Rule 2, then the transition path from state b/e 202 
to state N20& is taken. Otherwise, the extractor remains 
in state b/e 202 and continues reading tokens until one 
of the two rules is satisfied. 

[0052] In some applications, it may be feasible to add 
a GB state before the first b/e state, and a GE state after 
the last b/e state. This is because the left/right context 
of the first token of the first attribute in the first entry may 
be different from the left/right context of the first token 
of the first attribute in the second entry. Likewise, the 
left/right context of the last token of the last attribute in 
the last entry may be different from the left/right context 
of the last token of the last attribute in the second-to- last 
entry. 

[0053] Referring to Fig. 4, the token classes used to 
tokenize a text sequence are described. The first col- 
umn lists the name of the token class, the second col- 
umn lists the pattern of the token, the third column de- 
scribes the token type, and the last column lists the 
length of the token. Other token classes may be defined 
depending on the particular application. The learning 
program described later may use different token classes 
to generate the left/right context patterns. 
[0054] Referring to Fig. 5, the text sequence of the 
second line of the HTML source code shown in Fig. 1 
(b) is segmented into tokens. The first token is an HTML 
tag "<DT>", the second token is a space character " 
and the third token is an HTML tag "<A", etc. The elev- 
enth token is the punctuation 7", the twelfth token is the 
string "www", and the thirteenth token is the punctuation 
B .°, etc. 

[0055] The left context of token 1 ( B <DT> B ) is a new 
line character (which is the end of the previous line in 
Fig. 1 (b), and is not shown in Fig. 5). Thus, token 1 does 
not satisfy Rule 1 or Rule 2. Tokens 2, 3, 4, 5, 6, and 7 
also do not satisfy Rules 1 and 2 Token 8 satisfies Rule 
1 because the left context of token 8 is B HREF="", and 
the right context of token 8 is B http:// B . Therelore, at to- 



ken 8, the extractor transfers from state b/e 202 to state 
U 208. Note that under our definition, the separation 
point of left/right context for token 8 is between the quo- 
tation mark (") and the character 'h\ 

5 [0056] The contextual rules require that the extractor 
"look-ahead" and read additional tokens in order to 
make a comparison of the right context. For example, 
Rule 1 requires that the extractor read an additional 
three tokens to make the comparison of the right context 

10 because "http:// B includes four tokens. The specific 
number of additional tokens that need to be read de- 
pends on the particular set of contextual rules. Initially, 
the extractor may read a number of tokens required by 
the contextual rules. 

is [0057] Another example of a contextual rule that iden- 
tifies a transition from any other state to state A may be: 
Rule 3: Transfer to state A if 

Left context = Html(<DD>) Html(<l>) and 
Right context = C1alph(_) 

20 Rule 3 requires that the left context is an HTML tag 
"<DD>" followed by another HTML tag *<l>". In addition, 
the right context must be a string that begins with a cap- 
ital letter and has at least one lower case letter. 
[0058] Referring to Fig. 6, a Web page having content 

25 and layout format slightly different from the Web page 
of Fig. 1 (a) has a slightly different HTML text sequence 
(or source code) from that in Fig. 1(b). Accordingly, an 
extractor will use different contextual rules to identify the 
transitions between states associated with extracting 

30 different attributes from the Web page. The contextual 
rule for transition from any other state to state A in this 
example may be: 

Rule 4: Transfer to state A if 

Left context = (Html(</A>) Punc(,) Spc(_) Html 

35 (<|>) or 

Punc(_) NL(_) Spc(_) Htmt(<l>) or 
Punc(,) Spc(_) Html(<l>) and 
Right context = C1alph(_) 
Rule 4 requires that the left context is either (i) an HTML 

40 tag "</A>", followed by a comma, any number of spaces, 
and an HTML tag "<!>"; or (ii) any punctuation mark, fol- 
lowed by a new line character, any number of spaces, 
and an HTML tag "<I> B ; or iii) a comma, followed by any 
number of spaces, and an HTML tag "<l>". Rule 4 re- 

45 quires that the right context is a string that begins with 
a capital letter, and has at least one lower case letter. 
[0059] Referring to Fig. 7, a state transition diagram 
700 represents the state transition rules used by another 
embodiment of the information extractor. This is a uni- 

50 versal single pass information extractor that has states 
b/e, U, N, A, M, and D. It is suitable for extracting at- 
tributes U, N, A, and Mfrom an input text sequence. The 
input text sequence may come from a Web page such 
as the one shown in Fig. 1(a) or Fig. 6, or it may be a 

55 document generated by a word processor having similar 
attributes. 

[0060] This extractor can process documents having 
attributes U, N, A, and A/f that appear in any permutation 
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sequence. For example, the extractor can process an 
entry that has attributes (M, A, U, N). Such a sequence 
cannot be processed by the single pass extractor de- 
scribed in accordance with Fig. 2. The diagram in Fig. 
7 is used only as an example. If an input text sequence 
has different attributes, then the universal single pass 
information extractor will also have a different set of 
state transition rules. 

[0061] The extractor can process an entry that has at- 
tributes (U, N, M) by traversing the states b/e - U - D - 
N - D - M - b/e. When the extractor is in state D, the 
extractor is reading dummy tokens that are located be- 
tween attributes. The extractor starts with state b/e 702. 
The extractor reads sequential ones of tokens and re- 
mains in state b/e 702 via path 726 until a token belong- 
ing to attribute a is identified. Then the extractor trans- 
fers to state U 704 via path 712. The extractor reads 
tokens and remains in stale U 704 via path 728 until a 
token thai does not belong to attribute U is identified. 
The extractor stores (or outpuls) the extracted attribute 
U, then transfers to state D 712 via path 714. The ex- 
tractor reads tokens and remains in state 071 2 via path 
730 until a token that belongs to attribute N is identified. 
The extractor transfers to state W706 via path 716, ex- 
tracts the attribute N, then transfers to state D712 again 
via path 718. Next, the extractor goes to state M71 0 via 
path 722, extracts attribute M y then goes to state b/e 
702 via path 724, ending the cycle of extraction for this 
entry 

[0062] Referring to Fig. 8, a state transition diagram 
800 represents the state transition rules used by another 
embodiment of the information extractor. This is a mufti- 
pass information extractor suitable for extracting at- 
tributes U, N, A, and M in an input text sequence. If an 
input text sequence has different attributes, then the 
multi-pass information extractor will also have a different 
set of state transition rules. 

[0063] An information extractor utilizing the transition 
rules of Fig. 8 has the equivalent of six information ex- 
tractors 802, 804, 806, 808 : 810, and 81 2, each perform- 
ing partial extraction on the input text sequence. The 
extractor 802 scans through the input text sequence and 
finds the part of the input document, referred to as the 
"body"; that starts with the first attribute and ends with 
the last attribute. 

[0064] Referring to Fig. 9 (a), the first arrow pointing 
to "h" at line 2 shows the beginning of the first attribute, 
and the arrow pointing to "e" at line 10 shows the end 
of the last attribute. The extractor 802 finds the offset 
values indicated by the first and second arrows in Fig. 
9 (a), and sends these two offset values along with the 
input text sequence to the extractor 804. 
[0065] The state GB814 in Fig. 8 represents the state 
in which the extractor is performing functions associated 
with reading dummy tokens before the first attribute. The 
state Body 81 6 represents the state in which the extrac- 
tor is performing functions associated with reading to- 
kens within the body part, which includes the first and 



last attribute and all the text sequence in between. The 
extractor enters state GE818 after the last token of the 
last attribute has been read. One contextual rule, des- 
ignated as (GB, Body), is associated with the transition 
5 path 824 from state GB814 to state Body 81 6. Another 
contextual rule, designated as (Body, GE), is associat- 
ed with the transition path 826 from state BodyB16B^ 
to state GE 81 8. 

[0066] The extractor 802 in Fig. 8 starts in state GB 

10 814. It reads a token and compares the context of the 
token with the contextual rule (GB, Body). Because the 
first attribute of the input text sequence is either attribute 
Uor attribute N, the contextual rule (GB, Body) actually 
comprises of the rules (GB, U), and (GB, N). The rule 

is (GB, U) identifies the transition between state GB 814 
to state U (not shown here). The rule {GB, N) identifies 
the transition from state GB 81 4 to state N (not shown 
here). If either rule (GB, if) or rule (GB, N) is satisfied, 
then the extractor 802 enters state Body 816; other- 

20 wise, it remains in state GB 81 4. After the extractor en- 
ters the state Body 81 6, it stores the first token and the 
offset value of the start of the first token. It then reads 
the next token and compares the context of that token 
with the contextual rule (Body, GE). If the rule (Body, 

25 GE) is not matched, the extractor 802 stores the token, 
remains in state Body 816, and continues to read the 
next token. If rule (Body, GE) is satisfied, then the ex- 
tractor 802 enters state GE818, stores the offset value 
of the end of the last attribute, and ends the process for 

30 extractor 802. The extractor 804 then carries on the ex- 
traction process. 

[0067] Referring to Fig. 9 (b), the arrows indicate the 
start and end of attributes, representative of the offset 
values generated by the extractor 804. The attributes 

35 are extracted when the extractor 804 is in the state Tu- 
ple 820. The state Tuple' 822 refers to the state when 
the extractor 804 is reading dummy tokens located be- 
tween attributes. The extractor 804 receives the input 
text sequence and the two offset values generated by 

40 the extractor 802, and repeatedly extracts the attributes 
without identifying whether it is attribute U, N, A, or M 
The output of the extractor 804 is the offset values of 
the start and end of all attributes. After finding the offset 
values, the fours extractors 806, 808, 810, and 812 are 

45 invoked to extract attributes U, N, A, and M, respectively. 
[0068] The single pass and multi-pass information ex- 
tractors of the present invention can process documents 
having missing attributes. For example, one faculty 
member may not have an administrative title, while an- 

50 other may not have an URL address. The extractors of 
the present invention can also process documents hav- 
ing multiple attributes, e.g., a faculty may have two or 
more administrative titles. The extractors may also proc- 
ess documents having variant attribute ordering. For ex- 

55 ample, a Web page may have several entries with the 
academic title appearing before the administrative title, 
but also has one entry which the academic title appears 
after the administrative title. 
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[0069] Referring to Fig. 10, a tagged-list text se- 
quence 1 000 has five entries. Each entry includes at- 
tributes "Name", "E-Mail", "Last Update", "Aft. Name", 
"Organization", and/or "Service Provider". The text se- 
quence 1000 is different from the text sequence in Fig. 
1 (b) because text sequence 1000 contains the attribute 
names (herein referred to as "tags") for each attribute 
(herein referred to as "tag values"). An extractor can ex- 
tract the attribute names along with the attributes (°tag- 
and-value pairs"), and then apply a post-processing 
step to convert the tag-and-value pairs into data tuples 
of attribute values. 

[0070] Referring to Fig. 11 , a state transition diagram 
1100 represents the state transition rules used by an 
embodiment of the information extractor for extracting 
a tagged-list text sequence. The extractor is suitable for 
extracting attributes Tag and Value (Val). When the ex- 
tractor is in state Tag 1102, it is extracting tokens that 
belong to attribute Tag, such as "Name" or "E-Mail". 
When the extractor is in state Val 1106, it is extracting 
tokens that belong to attribute Value, such as " 'Lithium' 
J Smith", or "aulmer@u. Wash ington.edu", etc. When 
the extractor is in state Tag' 1104 and state VaV 1108, 
ri is extracting dummy tokens. 

[0071] Referring to Fig. 12 (a), the attributes sought 
to be extracted from the first entry in the text sequence 
of Fig. 10 are "Name", "E-Mail", "Last- Update", and "Or- 
ganization". Fig. 12 (b) shows the Tag and Value at- 
tributes extracted by an information extractor using the 
state transition diagram of Fig. 11 . Note that the extrac- 
tor does not differentiate between Tag or Value at- 
tributes that have different contents. A post processor 
can then transform the Tag and Value output in Fig. 12 
(b) into the attribute list format in Fig. 12(a). 

Generating the Transition Paths 

[0072] I n accordance to the present invention , a learn- 
ing program is provided to generate the transition paths 
used by an information extractor. In one embodiment, a 
sample text sequence representative of a target docu- 
ment is first given to the learning program. A user high- 
lights the attributes within the sample text sequence, 
and then identifies which highlighted part is associated 
with which attribute. In the description hereafter, the 
HTML sequence in Fig. 1 (b) will be used as the sample 
text sequence. 

[0073] Referring to Fig 13, a user first uses a GUI to 
highlight the attributes on a display screen. Here, the 
attributes are enclosed in boxes. Then the learning pro- 
gram interactively prompts the user to enter the attribute 
name for each highlighted part. From these highlighted 
parts and the attribute names provided by the user, the 
learning can identify the different attribute combinations 
within an entry. For example, the learning program may 
identify the sample text sequence to have four entries, 
each entry having the data tuple (U, N, A, A/J), (U, N, A), 
(U, N, A), and (N, A), respectively. Other methods for 



labeling the attributes can be used to accommodate the 
particular GUI or other kinds of input device. 
[0074] Referring to Fig. 14. a flow diagram 1400 rep- 
resents the process of generating a set of transition 

s paths. The learning program starts at block 1402. At 
block 1404, the learning program checks the first at- 
tribute of each data tuple. In this example, the first at- 
tribute is either Uor N. For each such attribute / that is 
the first attribute of a data tuple, a transition path is cre- 

io ated from state b/e to state /. Thus, transition paths (b/ 
e U) and (b/e N) are created. 
[0075] At block 1406, the learning program finds all 
possible attributes pairs that come in sequence. The 
possible attribute pairs are (U, A/), (N, A), and (A, M). At 

is block 1406, for each attribute pair (j, k) that comes in 
sequence, a transition path is created from state j to 
state /"' and from state /'to state k. State /' is a dummy 
state, and represents the state in which the extractor is 
reading a dummy token after the attribute / The transi- 

20 tion paths (U -> U'), (W -> N), (N -> N'), (N* -> A), (A -> 
A'), fA'-» M)are created in block 1406. 
[0076] At block 1 408, the learning program checks to 
see which is the last attribute for each data tuple. Here, 
the last attribute can be either A or M. At block 1408, for 

25 each such last attribute m, a transition path is created 
from state m to state b/e. The transition paths (A -» b/ 
e;and (M^> b/e) are created at block 1408. The transi- 
tion path generation process is ended at block 1410. A 
total of 10 transition paths are created in this example. 

30 These transition paths are consistent with the ones 
shown in Fig. 3. 

Generating the Contextual Rules 

35 [0077] In accordance with the present invention, the 
learning program can further generate the contextual 
rules used to determine when to enter or leave a state. 
The contextual rules are also used to determine which 
transition path to take when there are multiple transition 

40 paths branching out of a state. A sample text sequence 
with attributes correctly labeled by a user is given to the 
learning program. The learning program then generates 
the contextual rules by using a "set-covering" algorithm 
to cover all "positive token examples" and exclude all 

45 "negative token examples". Positive token examples 
are tokens such that their context should match the con- 
textual rules. Negative token examples are tokens such 
that their context should not match the contextual rules. 
[0078] Referring to Fig. 15 (a), a sample text se- 

50 quence having tokens A 1, A2, ...A20'\$ used to illustrate 
the method of generating a contextual rule. The text se- 
quence has a Name attribute "Yaser Abu-Mostafa" that 
includes tokens A16\o A18. The tokens A 1 to A15, and 
A19, A20 are dummy tokens. 

55 [0079] Referring to Fig. 1 5 (b), the extractor is defined 
to be in state Xwhen reading dummy tokens before at- 
tribute Name, and be in state N when reading tokens 
that belong to attribute Name. The contextual rule for 
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identifying the transition from state Xto state N can be 
written as: 

Rule 5: Transfer from state Xto state N if 

Left context = LP (left context pattern) and 
Right context = RP (right context pattern). 
[0080] The goal is to find the left context pattern (LP) 
and the right context pattern (RP). LP/RP should only 
match the left/right context of token A16, and should not 
match the left/right context of tokens A1 to A15. Token 
A16 is an example of a "positive sample", and tokens 
A1 to A15 are examples of "negative samples". A posi- 
tive sample is a token that the contextual rule should 
correctly match, and a negative sample is a token that 
the contextual rule should not match. In general, there 
may be more than one positive sample. The contextual 
rule should match each one of the positive samples, and 
not match any of the negative samples. In this example, 
the goal is to find LP/RP such that when the extractor 
applies Rule 5 to tokens A1 to A16, the extractor can 
correctly identify the transition from state Xto state Nai 
token A16. 

[0081] Referring to Fig. 15 (c), the first token of the 
right context of token A16 is A16; the second token of 
the right context of A16\sA17, and soon. Thefirsttoken 
of the left context of token A 16 is A 15; the second token 
of the left context of A 16 is A 14, and so on. 
[0082] Initially, the learning program takes the first to- 
ken of the left context of A16\o generate a left context 
pattern. The left context pattern is a token class that in- 
cludes the first token of the left context of A16. Because 
A15 is an HTML tag "<B>", the left context pattern for 
the first token of the left context can be Html(_) or Ptag 
(_). The token classes used by the learning program are 
listed in Fig. 16. 

[0083] Referring to Fig. 17 (a), a tentative contextual 
rule may be: 

Rule 6: Transfer from state Xto state N if 

Left context = Html(J. 
Any token having an HTML tag as first token of the left 
context will satisfy Rule 6. Applying Rule 6 to tokens A 1 
to A15, the token pattern Html(_) matches the left con- 
text of negative samples A2, A10, A13, and A15. The 
positive matching count (p) is defined as the number of 
positive samples that are matched, and the negative 
matching count (n) is defined as the number of negative 
samples that are matched. Here, p- 1 and n= 4. 
[0084] Because "<B>" is also a member of the token 
class Ptag (_) , another tentative contextual rule may be: 
Rule 7: Transfer from state Xto state Af if 

Left context = Ptag(_). 
The token pattern Ptag(_) matches the left context of 
negative samples A2, A10, A13, and A15. Thus p = 1 
and n = 4. 

[0085] The learning program then takes the first token 
of the right context of A16 to generate a right context 
pattern. In this example, A16 is a word "Yaser", thus the 
right context pattern for the first token of the right context 
may be Word(_), CNalph(_), or Clalph(_). These three 



token classes all match the token "Yaser". 
[0086] Rule 8 uses Word (J as the right context pat- 
tern. Comparing Word (_) with the right context of tokens 
A1 to A15 t Word (_) matches the right context of nega- 

5 tive samples A2, A4, A6, and A8. Thus p = 1 and n = 4. 
For Rule 9, CNalph(_) also matches the right context of 
four negative samples, thus p = 1 and n = 4. For Rule 
10, Clalph (_) matches the right context of negative sam- 
ples A2, A6, and A8, thus p = 1 and n = 3. 

10 [0087] Comparing Rules 6 to_10, Rule 10 results in 
the fewest incorrect matches (matching a negative sam- 
ple). Typically the learning program selects the tentative 
contextual rule that results in the greatest value for (p - 
n)/ (p + n). If applying two rules to the relevant tokens 

is result in the same values for (p - n)/(p + n), then the rule 
having a token class that is broader will be selected as 
the basis for finding the complete contextual rule. For 
example, the learning program will choose Rule 8 over 
Rule 9 because Word(_) has a broader scope than 

20 CNalph(_). The context pattern used in Rule 10 is in- 
complete because it still matches some negative sam- 
ples. 

[0088] Referring to Fig. 17 (b), the learning program 
finds the next set of tentative contextual rules (Rules 11 

25 - 15) by using Clalph(_) as basis, and adding one token 
pattern to either the left or right context pattern. For 
Rules 11 , Html(_) is added to the left context pattern of 
Rule 10. Thus, Rule 11 becomes: 

Rule 11: Transfer from state Xto state Wif 

30 Left context = HtmJ( J and 

Right context = C1alph(_). 
At this point, it is only necessary to compare Html(_) with 
the first token of the left context of tokens of A2, A6, and 
A8. Here, Html (_) matches the first token of the left con- 

35 text of A2, thus p = 1 and n = 1. For Rule 12, Rag (_) 
also matches the first token of the left context of A2 t thus 
p - 1 and n = 1 . 

[0089] For Rules 13 to 15, the learning program ex- 
pands the right context pattern by adding a token pattern 
40 that matches the second token of the right context of 
A16. The second-token of the right context of A16 is the 
space character " which belongs to the token classes 
Nonword(J, Ctrl(_), and Spc(_). Fur Rule 13, Nonword 
(_) is added to the right context pattern of Rule 1 0. Thus, 
45 Rule 13 becomes: 

Rule 13: Transfer from state Xto state Wif 
Right context = Clalph(_) Nonword(J- 
The learning program compare Nonword(J with the 
second token of the right context of the negative sam- 
50 pies matched by Rule 10. Here, Nonword(_) matches 
the second token of the right context A2 and A6, thus p 
= 1 and n= 2. For Rule 14, Ctrl(_) matches the second 
token of the right context of A2 and A6, thus p = 1 and 
n = 2. For Rule 15, Spc(_) also matches the second to- 
ss ken of the right context of A2 and A6, thus p = 1 and n 
= 2. Comparing Rules 11 to 15, Rule 11 and 12 results 
in the fewest incorrect matches. Because Html (_) is a 
broader class than Ptag (_), the learning program se- 
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lects Rule 11 as basis lor finding the complete context 
pattern tor the contextual rule. 

[0090] Referring to Fig. 17 (c), the learning program 
finds the next set of tentative contextual rules by using 
LP= Html(_) and fiP= Clalph (_) as basis, and adding 
one token pattern to either the left or right context pat- 
terns. The second token ol the left context of A16 is <A 
HREF= n http://electra.caltech.edu/EE/Faculty/Abu- 
Mostafa.htmr>. This token belongs to the token classes 
Html(_) and Ptag(_). Rule 16 is generated by adding 
Html(_) to the left context pattern of Rule 11 . Rule 17 is 
generated by adding Ptag(_) to the left context pattern 
of Rule 11. Rule 16 is: 

Rule 16: Transfer from state Xto state Wif 

Left context = Html(_) HtmIL) and 

Right context = C1alph(_). 
The learning program compares HtmIL) to the second 
token of the left context of the negative samples 
matched by Rule 11 . Here, Html (_) does not match any 
second token of the left context of the negative samples. 
Thus, p = 1 and n = 0. For Rule 17, Ptag(_) also does 
not match any second token of the left context of the 
negative samples, thusp = 1 and n= 0. Applying Rules 
1 8 - 20 all result in p = 1 and n = 1 . Thus, Rule 1 6 and 
17 are better than Rules 18-20. Rule 16 is selected as 
the final contextual rule because Html(_) is a broader 
class than Ptag(_). Therefore, the contextual rule for 
identifying the transition from state Xto state A is Rule 
16. 

[0091] Fig. 18 illustrates a flow diagram for generating 
a contextual rule in accordance with an embodiment of 
the present invention The learning program starts at 
block 1 802, and receives input from a user in block 1 804 
to identify positive and negative samples that identify 
transitions from one state to another. An example of a 
positive sample for the transition from state Xto state N 
is the token A 16 in Fig. 1 5 (c), and examples of negative 
samples are tokens A 1 to A 15. In block 1806, the learn- 
ing program generates a series of left context patterns 
that match the first token of the left context of the positive 
samples. Examples of such left context patterns are 
Html(J and Ptag(J. In block 1808, the learning pro- 
gram generates a series of right context patterns that 
match the first token of the right context of the positive 
samples. Examples of such right context patterns are 
Word(_), CNalph(_), and Clalph(_). 
[0092] In block 1810, the learning program compares 
the left context patterns generated in block 1806 with 
the first token of the left context of negative samples, 
and determines the n value representing the number of 
negative samples that are matched. The learning pro- 
gram then compares the right context patterns generat- 
ed in block 1808 with the first token of the right context 
of negative samples : and determines the n value repre- 
senting the number of negative samples that are 
matched. The learning program chooses the context 
pattern that results in the largest value lor (p- n)/(p + n). 
The number p represents the number of positive sam- 



ples. 

[0093] In block 1812, the learning program deter- 
mines whether the context patterns generated so far is 
complete. If the left/right context pattern selected in 

5 block 1810 does not match any negative samples, then 
that context pattern is complete, and control is trans- 
ferred to block 1816, and output the contextual rule us- 
ing the left/right context patterns generated thus far. In 
block 1812, if the left/right context pattern selected in 

10 block 1810 still matches any negative samples, then 
control is transferred to block 1814. The left (or right) 
context pattern is expanded by adding a token pattern 
that matches the next token of the left (or right) context 
of the positive samples. Typically, block 1814 requires 

is generating a series of context patterns and finding the 
one that results in the largest (p - n)/(p + n) value, similar 
to the actions performed in blocks 1806, 1808, and 
1810. Blocks 1812, 1814 and 1810 are repeated until a 
left/right context pattern that doesn't match any negative 

20 sample is generated. 

[0094] Referring to Fig. 1 9, Rules 21 - 28 form a sam- 
ple set of contextual rules that can be used by an infor- 
mation extractor to extract the attributes "Name", 
"Academic_title°, and "Admin^title" from a text se- 

25 quence similar to the one shown in Fig. 1 (b). Rule 21 is 
used to identify the transition from state GB to state b/ 
e. Rule 22 is used to identify the transition from state b/ 
e to state Name, and so on. The information extractor 
using Rules 21 - 28 has eight states: GB, b/e, Name, 

30 ©Name, Academic_title, @Academic_title, 
Admin_title, and GE The state ©Name refers to a 
dummy state in which the extractor is reading dummy 
tokens between attribute Name and the next attribute. 
Each one of the Rules 21 - 28 can be generated by the 

35 learning program in accordance with the flow diagram 
of Fig. 18. 

[0095] Referring to Fig. 20, the sample text sequence 
of Fig. 13 is shown with each attribute highlighted. For 
purpose of illustration, the first token of the first attribute 

40 of the four entries are labeled B1, B2, B3, and B4, re- 
spectively. Note that tokens B1, B2, and B3 have al- 
ready been identified as belonging to attribute U, and 
B4 as belonging to attribute N. The description below 
illustrates the process of generating the contextual rules 

45 for an information extractor having the state transition 
diagram o1 Fig. 2. In Fig. 2, there are paths leaving state 
b/e: transition paths (b/e^> U) and (b/e A/). 
[0096] To generate the contextual rule for the transi- 
tion path (b/e — > I/), the learning program identifies to- 

50 kens B1 , B2, and B3 as positive samples because they 
belong to attribute U. All other tokens as negative sam- 
ples. The learning program generates left/right context 
patterns that match only tokens B 1 , B2, and B3, but no 
other tokens. To generate the contextual rule for the 

55 transition path (b/e-> N), the learning program first iden- 
tifies token B4 as a positive sample because it belongs 
to attribute N. All other tokens as negative samples. The 
learning program then generates left/right context pat- 
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terns that match only token B4, but no other tokens. The 
contextual rules with other transition paths can be gen- 
erated in a similar way. 

[0097] The presently disclosed embodiments are 
considered in all respects to be illustrative and not re- 
strictive. The scope of the invention is indicated by the 
appended claims rather than the foregoing description, 
and all changes that come within the meaning and range 
of equivalents thereof are intended to be embraced 
therein. 



Claims 

1 . A method ot extracting attributes from a token se- 
quence, the method comprising the steps of: 

identifying the beginning of a first attribute from 
the token sequence by applying a first set of 
contextual rules to sequential ones of the token 
sequence including comparing a left context 
and a right context of a token of the token se- 
quence with a set of predetermined token pat- 
terns to determine whether one of said set of 
predetermined token patterns is satisfied; 
upon identifying the beginning of the first at- 
tribute, storing sequential tokens which follow 
the token associated with the beginning of the 
first attribute until the end of the first attribute is 
identified. 

2. The method of claim 1 wherein identifying the end 
of the first attribute includes applying a second set 
of contextual rules to sequential ones of the token 
sequence after the token associated with the begin- 
ning of the first attribute. 

3. The method of claim 2, after the step of identifying 
the end of the first attribute, further comprising the 
steps of: 

identifying the beginning ol a second attribute 
from the token sequence by applying a third set 
of contextual rules to sequential ones of the to- 
ken sequence after the token associated with 
the end of the first attribute, including compar- 
ing a left context and a right context of a token 
of the token sequence with a set of predeter- 
mined token patterns to determine whether one 
of said set of predetermined token patterns is 
satisfied; 

upon identifying the beginning of the second at- 
tribute, storing sequential tokens which follow 
the token associated with the beginning ol the 
second attribute until the end of the second at- 
tribute is identified; and 
identifying the end of the second attribute in- 
cluding applying a fourth set of contextual rules 



to sequential ones of the token sequence after 
the token associated with the beginning of the 
second attribute. 

5 4. The method of claim 3, wherein the first and second 
attributes belong to a predetermined set of at- 
tributes having a predetermined number of permu- 
tation sequences, and the set of predetermined to- 
ken patterns used in said third set of contextual 

10 rules include token patterns for identifying a transi- 
tion from a token not associated with any attribute 
to a token associated with an attribute that can pos- 
sibly follow the first attribute according to the pre- 
determined set of permutation sequences. 

15 

5. The method of claim 3, wherein the first and second 
attributes belong to a predetermined set of at- 
tributes having a predetermined number of permu- 
tation sequences, and the set of predetermined to- 

20 ken patterns used in said third set of contextual 

rules include token patterns for identifying a transi- 
tion from a token not associated with any attribute 
to a token associated with any attribute belonging 
to the predetermined set of attributes. 

25 

6. A method of extracting a first subset of tokens hav- 
ing a first attribute and a second subset of tokens 
having a second attribute from a token sequence, 
the method comprising the steps of: 

30 

identifying the first token of the first subset of 
tokens by applying a first set of contextual rules 
to sequential ones of the token sequence in- 
cluding comparing the left and right context of 
35 a token of the token sequence with a set of pre- 

determined token patterns to determine wheth- 
er one of said set of predetermined token pat- 
terns is satisfied; 

upon identifying the first token of the first subset 
40 of tokens, storing sequential tokens that follow 

the first token of the first subset of tokens until 
the last token of the first subset of tokens is 
identified. 

45 7. The method of claim 6 wherein identifying the last 
token of the first subset of tokens includes applying 
a second set of contextual rules to sequential ones 
of the token sequence after the last token of the first 
subset of tokens. 

so 

8. The method of claim 7, after the step of identifying 
the last token of the first subset of tokens, further 
comprising the steps of: 

ss identifying the first token of the second subset 

of tokens by applying a third set of contextual 
rules to sequential ones of the token sequence 
after the last token of the first subset of tokens, 
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including comparing the left and right context 
of a token of the token sequence with a set of 
predetermined token patterns to determine 
whether one of said set of predetermined left 
and right token patterns is satisfied; s 
upon identifying the first token of the second 
subset of tokens, storing sequential tokens 
which follow the first token of the second subset 
of tokens until the last token of the second sub- 
set of tokens is identified. to 

9. A method of extracting a first subset of tokens hav- 
ing a first attribute and a second subset of tokens 
having a second attribute from a token sequence, 
the second subset of tokens appearing after the first is 
subset of tokens, the first and second attribute se- 
lected Irom a group of predetermined attributes, the 
method comprising the steps of: 

identifying the first token of the first subset of 20 
tokens by applying a first set of rules to sequen- 
tial ones of tokens in the token sequence; 
identifying the last token of second subset of 
tokens by applying a second set of rules to se- 
quential ones of tokens after the first token of 2s 
the first subset of tokens; 
identifying the last token of the first subset of 
tokens by applying a third set of rules to se- 
quential ones of tokens after the first token of 
the first subset of tokens; 30 
identifying the first token of the second subset 
of tokens by applying a fourth set of rules to se- 
quential ones of tokens after the last token of 
the first subset of tokens; 

identifying the attribute associated with the first 35 
subset of tokens; and 

identifying the attribute associated with the sec- 
ond subset of tokens. 



rules to the token sequence. 

12. The method of claim 11, wherein the transition from 
said first attribute to said second attribute includes 
a first transition from said first attribute to a dummy 
attribute, and a second transition from said dummy 
attribute to said second attribute. 

13. The method of claim 11 , wherein the attributes with- 
in the token sequence belong to a predetermined 
set of attributes, and said plurality of rules include 
rules for characterizing a transition from one at- 
tribute within the set of attributes to any other at- 
tribute within the set of attributes. 

14. The method of claim 11 , wherein the attributes with- 
in the token sequence belong to a predetermined 
set of attributes, and said plurality of rules include 
rules for characterizing a transition from one at- 
tribute within the set of attributes to a limited number 
of other attributes within the set of attributes. 

15. A method of extracting subsets of tokens from a to- 
ken sequence, each subset of tokens having an at- 
tribute selected from a group of attributes, the meth- 
od comprising the steps of: 

accepting a plurality of rules that identifies the 
borders of different subsets of tokens having 
different attributes by matching the left context 
and right context of a token to a predetermined 
set of left and right token patterns; 
accepting the token sequence; 
identifying the start and end of subsets of to- 
kens by applying the plurality of rules to se- 
quential ones of the token sequence; and 
storing the subsets of tokens along with the at- 
tributes associated with the subset of tokens. 



10. The method of claim 9, wherein the first set of rules 40 
for identifying the first token of the first subset of 
tokens includes rules for comparing a left context 
and a right context of a token with a set of prede- 
termined left and right token patterns to determine 
whether one of said set of predetermined left and <*5 
right token patterns is satisfied. 

11. A method of extracting attributes from a token se- 
quence, each attribute comprising at least one to- 
ken, the method comprising the steps of: 50 

accepting a plurality of rules which characterize 
a transition from a first attribute to a second at- 
tribute in the token sequence in terms of the left 
context and right context of a token; 55 
accepting the token sequence; 
identifying the borders between attributes in the 
token sequence by applying the plurality of 



16. The method of claim 15, wherein the step of identi- 
fying the start and end of subsets of tokens com- 
prises the steps of: 

applying a first subset of the plurality of rules to 
sequential ones of tokens in the token se- 
quence to identify the start of the first subset of 
tokens and the end of the last subset of tokens; 
applying a second subset of the plurality of 
rules to sequential ones of the token sequence 
after the start of the first subset of tokens and 
before the end of the last subset of tokens, to 
identify the start and end of each subset of to- 
kens; 

applying a third subset of the plurality of rules 
to identify the attributes associated with each 
subset of tokens. 

17. A system for identifying attributes in a token se- 
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quence, comprising: 

a storage device lor storing a plurality of rules 
which characterize a transition from a first at- 
tribute to a second attribute in the token se- 
quence in terms of the left context and right 
context of a token; 

a processor for accepting the token sequence 
and identifying the first and last tokens of at- 
tributes in the token sequence by applying the 
plurality of rules to the token sequence. 

18. The system of claim 15 : wherein the plurality of 
rules characterize all possible transitions from afirst 
attribute to a second attribute, and said processor 
applies the plurality of rules to sequential ones of 
the token sequence to identify the first and last to- 
ken of each attribute in the token sequence within 
a single pass of reading the token sequence. 

19. The system of claim 15 : wherein the plurality of 
rules comprises three subsets of rule: 

a first subset of rules for identifying the first to- 
ken of the first attribute and last token of the last 
attribute in the token sequence; 
a second subset of rules for identifying the first 
and last token of all attributes without actually 
identifying the individual attributes, and 
a third subset of rules for identifying each indi- 
vidual attribute. 



20. A method of generating a contextual rule used by 
an extractor for identifying transitions between sub- 
sets of tokens having different attributes, compris- 
ing the steps of: 

receiving a sample token sequence including a 
first subset of tokens having a first attribute and 
a second subset of tokens not associated with 
the first attribute, the second subset of tokens 
coming before the first subset of tokens in the 
token sequence; 

identifying the first token of the first subset of 
tokens; 

generating a pair of left and right context pat- 
terns that match the left and right context of the 
first token of the first subset of tokens but not 
the left and right context of the tokens belong- 
ing to the second subset of tokens; 
whereby the left and right context patterns form 
the matching criteria of the contextual rule for 
identifying the transition from the second sub- 
set of tokens to the first subset of tokens. 

21 . The method of claim 20, wherein the step of gener- 
ating the pair of left and right context patterns com- 
prises the steps of: 



finding a first set of token classes that matches 
the first token of the right context of the first to- 
ken of the first subset of tokens; 
finding a second token classes that matches 

5 the first token of the left context of the first token 

of the first subset of tokens; 
select a token class among the first set and sec- 
ond set of token classes such that the number 
of tokens in the second subset of tokens 

10 matched by said token class is the least; 

assign said token class as part of the left con- 
text pattern if said token class matches the left 
context of the first token of the first subset of 
tokens; and 

is assign said token class as part of the right con- 

text pattern if said token class matches the right 
context of the first token of the first subset of 
tokens. 

20 22. The method of claim 21, further comprising the 



finding a third set of token classes that matches 
an additional token of the left context of the first 
token of the first subset of tokens that is not al- 
ready matched by the left context pattern or the 
right context pattern; 

finding a fourth of token classes that matches 
an additional token of the right context of the 
first token of the first subset of tokens that is not 
already matched by the left context pattern or 
the right context pattern; 
select a token class among the third set and 
fourth set of token classes such that the number 
of tokens in the second subset of tokens 
matched by said token class is the least; 
adding said token class as part of the left con- 
text pattern if said token class matches the left 
context of the first token of the first subset of 
tokens; 

adding said token class as part of the right con- 
text pattern if said token class matches the right 
context of the first token of the first subset of 
tokens. 



25 



30 



35 



40 



45 



23. A method of generating a set of contextual rules 
used by an extractor for identifying the transitions 
between subsets of tokens having different at- 
tributes, the subsets of tokens being grouped in a 
50 plurality entries, comprising the steps of: 

receiving a sample token sequence; 
identifying each subset of tokens with corre- 
sponding attribute names; 
55 grouping the subsets of tokens into separate 

entries; and 

generating a contextual rule for each possible 
transitions between attributes; 
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wherein said contextual rule includes compar- 
ing the left and right context of a token to a pair 
of predetermined left and right context patterns, 
and a transition is identified when a match is 
found. 5 

24. A system for generating contextual rules for identi- 
fying attributes in a token sequence, comprising: 

a memory for storing a sample token sequence; io 
a user interface for allowing a user to identify 
at least one token within the sample token se- 
quence that identifies the transition from a first 
attribute to a second attribute; 
a processor for receiving the sample token se- ?5 
quence and generating a pair of left and right 
context patterns that match the left and right 
context of said at least one token that identifies 
the transition of attributes, whereby the left and 
right context patterns form the matching criteria 20 
of a contextual rule for identifying the transition 
from said first attribute to said second attribute. 

25. A method of generating transition rules for a single- 
pass extractor from a text sequence, the text se- 2S 
quence having a plurality of entries, each one of the 
plurality of entries having a set of attributes and a 

set of labels corresponding to the set of attributes, 
the method comprising the steps of: 

30 

(a) determining all possible permutations of at- 
tribute sequence within an entry from the set of 
attributes and set of labels in the text sequence; 
and 

(b) creating a database of transition rules which 35 
identifies all possible transitions between at- 
tributes determined by step (a), each transition 
rule comprising the transitions from a first at- 
tribute to a dummy attribute and then to a sec- 
ond attribute. 40 

26. A method of generating contextual rules for identi- 
fying transitions from a first subset of tokens to a 
second subset of tokens within a token sequence, 

the method comprising the steps of: 45 

(a) identifying the positive token samples within 
the token sequence; 

(b) identifying the negative token samples with- 
in the token sequence; 50 

(c) applying a set-covering algorithm to gener- 
ate a set of left and right context patterns that 
match the left and right context of all positive 
token samples and not match the left and right 
context of any negative token sample. 55 
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Token_Type 

CR 

ML 

TAB 

PUNC 

SPC 

Calph 

Clalph 

Oalph 
BIG5 



Pattern 
\r 
\n 

+ - / ? 
•* 

A ABC 
Abe 

abc 



Description 

The carriage return character. 

The new line character. 

The Tab character. 

Any symbol except A-2,a-Z,0-9 

The space character 

All letters in a English word are 
capitalization 

All letters in a English word are 
lowercase letters except the fist 
one 

All letters in a English word are 
lowercase letters 

A String begins with a Chinese word 
and ends with a Chinese word 



Length (byte) 
Len=l 
Len»l 
Len-1 
Len=l 
Len-1 
Len>*l 

Len>-2 

Len>«l 
Len>«2 



Num 1.23 1,234,567 

+1.23 +1,234,567 
-1.23 -1,234,567 
1*2+3 1/100 



Single digit or any numberic 
expression contains 
digits or " + m m — M " /" m + m * , " 
But single n + m " /" m * m " , 

doesn't belongs to this type 



Len>-1 



Nalph 



F16 16F 1F6 A String contains letters or digits 

But this string can't be a English 
word or a numberic expression 



Len>«2 



Html 



<HTMI*> </HTML> 



Any html tag 



Link 



<a 



<xmg 



Any hyperlink tag 



Fig. 4 
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